| Due date | Turn in |
|---|---|
| 11:45pm Mon Sep 9 | Assignment 3 Repo on GitHub has been created |
| 11:45pm Mon Sep 16 | Final solutions available on repo |
ETC5521 Diving Deeper into Data Exploration: Assignment 3
As per Monash’s integrity rules, these solutions are not to be shared beyond this class.
🎯 Goal
The assignment is designed to assess your knowledge of the foundation of EDA, distinctions between EDA and IDA, and ability to construct null samples for a particular problem. The assignment represents 20% of your final grade for ETC5521. This is an individual assignment.
📌 Guidelines
Accept the GitHub Classroom Assignment provided in Moodle using a GitHub Classroom compatible web browser. This should generate a private GitHub repository that can be found at https://github.com/etc5521-2024. Your GitHub assignment 3 repo should contain the file
assign03.html,README.md,assign03-submission.qmd,assignment.css,etc5521-assignment3.Rproj,.gitignore, and data files generated from your work needed to render your solution file. Code should be available but folded in report.Answer each question in the
assign03-submission.qmdin the repo.For the final submission knit
assign03-submission.qmdwhich will contain your answers. Make sure to provide the link to the script of any Generative AI conversation you employed in arriving at your solution. Note that marks are allocated for overall grammar and structure of your final report.Leave all of your files in your GitHub repo for marking. We will check your git commit history. You should have contributions to the repo with consistent commits over time. (Note: nothing needs to be submitted to Moodle.)
You are expected to develop your solutions by yourself, without discussing any details with other class members or other friends or contacts. You can ask for clarifications from the teaching team and we encourage you to attend consultations to get assistance as needed. As a Monash student you are expected to adhere to Monash’s academic integrity policy. and the details on use of Generative AI as detailed on this unit’s Moodle assessment overview.
Deadlines:
🛠️ Exercises
Question 1: Is it really there?
For each of the following plot descriptions, write out the null hypothesis being tested, and explain how you would generate null samples.
doubledecker(xtabs(n ~ Dept + Gender + Admit, data = ucba),
gp = gpar(fill = c("grey90", "orangered"))
)
ggplot(landmine, aes(x, y)) +
geom_point(alpha=0.6)
Question 2: Can you detect landmine locations?
The data for 1b is available in the file landmine3.csv. It represents an image taken over a field in an attempt to discover the location of landmines. The purpose is to clean the field by safely removing landmines.
- What alternative plots might be made for the data that might help to discover the landmine locations? Plot your data with several of these choices of displays.
- Can you see any potential locations of landmines? Explain.
- Conduct a lineup experiment with your choice of plot. The steps to doing a lineup experiment are:
- Construct a lineup of your choice of plot.
- Show your lineups to 8 friends, individually, who are not taking this unit, and ask them to choose the most different plot, and to explain to you why they have made that choice.
- Compute and report the \(p\)-value, and summarise the reasons that your friends made. (You need to show this to each friend individually so that you get an independent evaluation of the plot.)
NOTE: The data was simulated, and has three elements of interest, four small “holes” in each of the corners which one would interpret as potential landmine locations, and text string that says “you found me” (rotated and reversed) in the middle.
- The focus is on density rather than linear relationship, so alternative plot choices should include density plots.
landmine <- read_csv("data/landmine3.csv")
p1 <- ggplot(landmine, aes(x, y)) +
geom_point(alpha=0.1) +
theme(axis.title = element_blank(),
axis.text = element_blank())
p2 <- ggplot(landmine, aes(x, y)) +
geom_point(alpha=0.1) +
geom_density_2d(bins=50, colour="orange") +
theme(axis.title = element_blank(),
axis.text = element_blank())
p3 <- ggplot(landmine, aes(x, y)) +
geom_density_2d_filled(bins=50) +
theme(legend.position="none",
axis.title = element_blank(),
axis.text = element_blank())
p1 + p2 + p3 + plot_layout(ncol=3)Ideally one sees both the small blank spots, although this is hard, and the high density of points in the middle corresponding to the text.
You can generate lineup code like
ggplot(lineup(null_permute("x"), landmine), aes(x, y)) +
geom_point(alpha=0.1) +
facet_wrap(~.sample, ncol=5) +
theme(axis.title = element_blank(),
axis.text = element_blank())The lineups for your choice of plot design is shown to your 8 friends, and you record the number of times the data is detected.
Use the pvisual() function to compute the \(p\)-value, as follows:
pvisual(4, 8, 20) x simulated binom
[1,] 4 0.0013 0.00037
(This calculation is 4 of the 8 friends detected the data.)
Potential reasons might be the high density in the middle. It would be nice if the choice of plot also revealed the four empty holes in the corners, but this might be harder to spot.
Marking:
- 1 point for suitable plot designs, with explanation justifying choice
- 1 point for spotting the high-density, and 1 point for the four holes. BONUS: 1 point if recognised the words, too.
- 1 point for study: summarising results from 8 friends (0.5), and computing the \(p\)-value appropriately (0.5)
Question 3: Exploring the relationships in availability of clean fuel and import/exports of fuel.
For WDI data, just 2022, from assignment 2 (using the one created in the solution wdi_valid.csv using the lists of features for one- and two-variable distributions summarise:
- The distribution of each of the variables, and
- The relationship between each of the pairs of variables.
Next,
- Decide on which variables to transform, and examine the before and after patterns.
- Write down three things that you would expect to see in this data, e.g. fuel imports and exports should be negatively related.
- What are three things that you find to be most surprising, or unexpected in the data? (These do not all need to be related to d.)
The original website has more information about the variables (indicators).
Question 4: Predicting the winner
The next US presidential election will be held Nov 5, 2024. There are many polls being routinely conducted collecting preferences of the potentially voting public. The data provided in the file polls_Sep1_2024.csv contains polls for the popular vote collated by fivethirtyeight.com. It has been cleaned and re-organised.
- The variable
populationhas several categories. Explain what each of these means, and how results based on different types might be expected to be different. - Make your choice of plots to question whether pollsters are operating impartially, or whether they are biased. Explain what you find from this data.
Generative AI analysis
In this part, we would like you to actively discuss how generative AI helped with your answers to the assignment questions, and where or how it was mistaken or misleading.
You need to provide a link that makes the full script of your conversation with any generative AI tool accessible to the teaching staff. You should not use a paid service, as the freely available systems will sufficiently helpful.
Marks
| Part | Points |
|---|---|
| Q1 | 4 |
| Q2 | 4 |
| Q3 | 7 |
| Q4 | 5 |
| GitHub Repo | -2 |
| Generative AI Analysis | -3 |
| Formatting, Spelling & Grammar | -3 |
Note that the negative marks for “Generative AI Analysis”, “Formatting, Spelling & Grammar” correspond to reductions in scores. You can lose up to 3 marks for poor use of the GAI. For example, no use, basic questions only, no link to the script, and no acknowledgment but clearly used. You can lose up to 3 marks for poorly formatted and written answers. Two marks will be deducted if you have NOT accepted the assignment and created your own repo by 11:45pm Mon Sep 9, and up to two marks for insufficient GitHub commits.